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Abstract 



This report reviews the conceptual and theoretical links between Granger causality and directed information 
, theory. We begin with a short historical tour of Granger causality, concentrating on its closeness to information 

theory. The definitions of Granger causality based on prediction are recalled, and the importance of the observation 
set is discussed. We present the definitions based on conditional independence. The notion of instantaneous coupling 
' is included in the definitions. The concept of Granger causality graphs is discussed. We present directed information 

theory from the perspective of studies of causal influences between stochastic processes. Causal conditioning appears 
• to be the cornerstone for the relation between information theory and Granger causality. In the bivariate case, the 

fundamental measure is the directed information, which decomposes as the sum of the transfer entropies and a term 
quantifying instantaneous coupling. We show the decomposition of the mutual information into the sums of the 
transfer entropies and the instantaneous coupling measure, a relation known for the linear Gaussian case. We study 
the multivariate case, showing that the useful decomposition is blurred by instantaneous coupling. The links are 
further developed by studying how measures based on directed information theory naturally emerge from Granger 
5_j 1 causality inference frameworks as hypothesis testing. 

keyword: Granger causality, transfer entropy, information theory, causal conditioning, conditional independence 

I. Introduction 

This review deals with the analysis of influences that one system, be it physical, economical, biological or social, 
for example, can exert over another. In several scientific helds, the finding of the influence network between different 
systems is crucial. As examples, we can think of gene influence networks H75L f76l , relations between economical 
variables ||29ll , ll80l . communication between neurons or the flow of information between different brain regions 
[84], or the human influence on the Earth climate 14T1 . (88), and many others. 

The context studied in this report is illustrated in figure Q] For a given system, we have at disposal a number of 
different measurements. In neuroscience, these can be local field potentials recorded in the brain of an animal; In 
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Fig. 1. Illustration of the problem of information flow in networks of stochastic processes. Each node of the network is associated to a signal. 
Edges between nodes stand for dependence (shared information) between the signals. The dependence can be directed or not. This framework 
can be applied to different situations as solar physics, neuroscience or the study of turbulence in fluids, as illustrated by the three examples 
depicted here. 



solar physics, these can be solar indices measured by sensors onboard some satellite; In the study of turbulent fluids, 
these can be the velocity measured at different scales in the fluid (or can be as in the figure, the wavelet analysis of 
the velocity at different scales). For these different examples, the aim is to find dependencies between the different 
measurements, and if possible, to give a direction to the dependence. In neuroscience, this will allow to understand 
how information flows between different areas of the brain; In solar physics, this will allow to understand the links 
between indices and their influence on the total solar irradiance received on Earth; In the study of turbulence, this 
can confirm the directional cascade of energy from large down to small scales. 

In a graphical modeling approach, each signal is associated to a particular node of a graph, and dependence are 
represented by edges, directed if a directional dependence exists. The questions addressed in this paper concern the 
assessment of directional dependence between signals, and thus concern the inference problem of estimating the 
edge set in the graph of signals considered. 

Climatology and neuroscience were already given as examples by Norbert Wiener in 1956 ll95l , a paper which 
inspired econometrist Clive Granger to develop what is now termed Granger causality [32|. Wiener proposed in this 
paper that a signal x causes another time series y, if the past of x has a strictly positive influence on the quality 
of prediction of y. Let us quote Wiener [95|: 

"As an application of this, let us consider the case where fi(a) represents the temperature at 9 A.M. 
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in Boston and /2(a) represents the temperature at the same time in Albany. We generally suppose that 
weather moves from west to east with the rotation of the earth; the two quantities 1 - C and its correlate 
in the other direction will enable us to make a precise statement containing some if this content and 
then verify whether this statement is true or not. Or again, in the study of brain waves we may be able 
to obtain electroencephalograms more or less corresponding to electrical activity in different part of the 
brain. Here the study of coefficients of causality running both ways and of their analogues for sets of 
more than two functions / may be useful in determining what part of the brain is driving what other part 
of the brain in its normal activity" 
In a wide sense, Granger causality can be summed up as a theoretical framework based on conditional indepen- 
dence to assess directional dependencies between time series. It is interesting to note that Norbert Wiener influenced 
Granger causality, as well as another field dedicated to the analysis of dependencies: information theory. Information 
theory has led to the definition of quantities that measure the uncertainty on variables using probabilistic concepts. 
Furthermore, this has led to the definition of measures of dependence based on the decrease in uncertainty relating 
to one variable after observing another one. Usual information theory is, however, symmetrical. For example, the 
well-known mutual information rate between two stationary time series is symmetrical under an exchange of the 
two signals: the mutual information assesses the undirectional dependence. Directional dependence analysis viewed 
as an information-theoretic problem requires the breaking of the usual symmetry of information theory. This was 
realized in the 1960's and early 1970's by Hans Marko, a German professor of communication. He developed the 
bidirectional information theory in the Markov case [57]. This theory was later generalized by James Massey and 
Gerhard Kramer, to what we may now call directed information theory |59|, B31 . 

It is the aim of this report to review the conceptual and theoretical links between Granger causality and directed 
information theory. 

Many information-theoretic tools have been designed for the practical implementation of Granger causality ideas. 
We will not show all of the different measures proposed, because they are almost always particular cases of 
the measures issued from directed information theory. Furthermore, some measures might have been proposed in 
different fields (and/or at different periods of time) and have received different names. We will only consider the 
well-accepted names. This is the case, for example, of 'transfer entropy', as coined by Schreiber in 2000 |81|, but 
which appeared earlier under different names, in different fields, and might be considered under slightly different 
hypotheses. Prior to developing a unified view of the links between Granger causality and information theory, we 
will provide a survey of the literature, concentrating on studies where information theory and Granger causality are 
jointly presented. 

Furthermore, we will not review any practical aspects, nor any detailed applications. In this spirit, this report 
is different from ll36l . which concentrated on the estimation of information quantities, and where the review is 
restricted to transfer entropy. For reviews on the analysis of dependencies between systems and for applications of 
Granger causality in neuroscience, we refer to l67l . 11251 . We will mention however some important practical points 
in our conclusions, where we will also discuss some current and future directions of research in the field. 
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A. What is, and what is not, Granger causality 

We will not debate the meaning of causality or causation. We instead refer to [69|. However, we must emphasize 
that Granger causality actually measures a statistical dependence between the past of a process and the present of 
another. In this respect, the word causality in Granger causality takes on the usual meaning that a cause occurs 
prior its effect. However, nothing in the definitions that we will recall precludes that signal x can simultaneously 
be Granger caused by y and be a cause of y \ This lies in the very close connection between Granger causality and 
the feedback between times series. 

Granger causality is based on the usual concept of conditioning in probability theory, whereas approaches 
developed for example in [69) , BP relied on causal calculus and the concept of intervention. In this spirit, 
intervention is closer to experimental sciences, where we imagine that we can really, for example, freeze some 
system and measure the influence of this action on another process. It is now well-known that causality in the sense 
of between random variables can be inferred unambiguously only in restricted cases, such as directed acyclic graph 
models (94), [50), [69), BP . In the Granger causality context, there is no such ambiguity and restriction. 

B. A historical viewpoint 

In his Nobel prize lecture in 2003, Clive W. Granger mentioned that in 1959, Denis Gabor pointed out the work 
of Wiener to him, as a hint to solve some of the difficulties he met in his work. Norbert Wiener's paper is about the 
theory of prediction |95|. At the end of his paper, Wiener proposed that prediction theory could be used to define 
causality between time series. Granger further developed this idea, and came up with a definition of causality and 
testing procedures [28], [29|. 

In these studies, the essential stones were laid. Granger's causality states that a cause must occur before the 
effect, and that causality is relative to the knowledge that is available. This last statement deserves some comment. 
When testing for causality of one variable on another, it is assumed that the cause has information about the effect 
that is unique to it; i.e. this information is unknown to any other variable. Obviously, this cannot be verified for 
variables that are not known. Therefore, the conclusion drawn in a causal testing procedure is relative to the set 
of measurements that are available. A conclusion reached based on a set of measurements can be altered if new 
measurements are taken into account. 

Mention of information theory is also present in the studies of Granger. In the restricted case of two Gaussian 
signals, Granger already noted the link between what he called the 'causality indices' and the mutual information 
(Eq. 5.4 in [28)). Furthermore, he already foresaw the generalization to the multivariate case, as he wrote in the 
same paper: 

"In the case of q variables, similar equations exist if coherence is replaced by partial coherence, and a 

new concept of 'partial information' is introduced." 
Granger's paper in 1969 does not contain much new information, but rather, it gives a refined presentation of the 
concepts. 
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During the 1970's, some studies, e.g. l80l , ifTTl . 113511 , appeared that generalized along some of the directions 
Granger's work, and related some of the applications to economics. In the early 1980's, several studies were 
published that established the now accepted definitions of Granger causality ll30ll . |fl9l , 1121 . These are natural 
extensions of the ideas built upon prediction, and they rely on conditional independence. Finally, the recent studies 
of Dalhaus and Eichler allowed the definitions of Granger causality graphs [14|, IfTTl . ifTSl . These studies provide 
a counterpart of graphical models of multivariate random variables to multivariable stochastic processes. 

In two studies published in 1982 and 1984 11221 . (23), Geweke, another econometrician, set up a full treatment of 
Granger causality testing for the Gaussian case, which included the idea of feedback and instantaneous coupling. In 
ll22l . the study was restricted to the link between two time series (possibly multidimensional). In this study, Geweke 
defined an index of causality from x to y; It is the logarithm of the ratio of the asymptotic mean square error when 
predicting y from its past only, to the asymptotic mean square error when predicting y from its past and from the 
past of x. Geweke also defined the same kind of index for instantaneous coupling, and showed, remarkably, that 
the mutual information rate between x and y decomposes as the sum of the indices of causality from x to y and 
from y to x with the index of instantaneous coupling. This decomposition was shown in the Gaussian case, and it 
remains valid in any case when the indices of causality are replaced by transfer entropy rates, and the instantaneous 
coupling index is replaced by an instantaneous information exchange rate. This link between Granger causality and 
directed information theory was further supported by J4], J9) (without mention of instantaneous coupling in [9]), 
and the generalization to the nonGaussian case by ID (see also f74l for related results). However, prior to these 
recent studies, the generalization of Geweke's idea to some general setting was reported in 1987, in econometry 
by Gourieroux et al. Il26l . and in engineering by Rissannen&Wax |77|. Gourieroux and his co-workers considered 
a joint Markovian representation of the signals, and worked in a decision-theoretic framework. They defined a 
sequence of nested hypotheses, whether causality was true or not, instantaneous coupling was present or not. They 
then worked out the decision statistics using the Kullback approach to decision theory 1481 . in which discrepancies 
between hypotheses are measured according to the Kullback divergence between the probability measures under 
the hypotheses involved. In this setting, the decomposition obtained by Geweke in the Gaussian case was evidently 
generalised . In j77|, the approach taken was closer to Geweke's study, and it relied on system identification, in 
which the complexity of the model was taken into account. The probability measures were parameterized, and an 
information measure that jointly assessed the estimation procedure and the complexity of the model was used when 
predicting a signal. This allowed Geweke's result to be extended to nonlinear modeling (and hence the nonGaussian 
case), and provided an information-theoretic interpretation of the tests. Once again, the same kind of decomposition 
of dependence was obtained by these authors. We will see in section [Til] that the decomposition holds due to Kramers 
causal conditioning. These studies were limited to the bivariate case l26l . 177*1 . 

In the late 1990's, some studies began to develop in the physics community on influences between dynamical 
systems. A first route was taken that followed the ideas of dynamic system studies for the prediction of chaotic 
systems. To determine if one signal influenced another, the idea was to consider each of the signals as measured 
states of two different dynamic systems, and then to study the master-slave relationships between these two systems 
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(for examples, see [89|, [72]). The dynamics of the systems was built using phase space reconstruction [40]. The 
influence of one system on another was then defined by making a prediction of the dynamics in the reconstructed 
phase space of one of the processes. To our knowledge, the setting was restricted to the bivariate case. A second 
route, which was also restricted to the bivariate case, was taken and relied on information-theoretic tools. The main 
contributions were from Palus and Schreiber lISD . Il65l . with further developments appearing some years later l37l . 
Il66l . [20 1 . In these studies, the influence of one process on the other was measured by the discrepancy between the 
probability measures under the hypotheses of influence or no influence. Naturally, the measures defined very much 
resembled the measures proposed by Gourieroux et. al [26|, and used the concept of conditional mutual information. 
The measure to assess whether one signal influences the other was termed transfer entropy by Schreiber. Its definition 
was proposed under a Markovian assumption, as was exactly done in 11261 . The presentation by Palus l65l was more 
direct and was not based on a decision-theoretic idea. The measure defined is, however, equivalent to the transfer 
entropy. Interestingly, Palus noted in this 2001 paper the closeness of the approach to Granger causality, as per the 
quotation: 

"the [latter] measure can also be understood as an information theoretic formulation of the Granger 
causality concept." 

Note that most of these studies considered bivariate analysis, with the notable exception of [20], in which the 
presence of side information (other measured time series) was explicitely considered. 

In parallel with these studies, many others were dedicated to the implementation of Granger causality testing in 
fields as diverse as climatology (with applications to the controversial questions of global warming) and neuroscience; 
see ED, ED, BE), QDD, G3, CD, (25), (64), to cite but a few. 

In a very different field, information theory, the problem of feedback has lead to many questions since the 
1950's. We will not review or cite anything on the problem created by feedback in information theory as this is 
not within the scope of the present study, but some information can be found in [13]. Instead, we will concentrate 
on studies that are directly related to the subject of this review. A major breakthrough was achieved by James 
Massey in 1990 in a short conference paper [59|. Following the (lost?) ideas of Marko on bidirectional information 
theory that were developed in the Markovian case lf57l . Massey re-examined the usual definition of what is called 
a discrete memoryless channel in information theory, and he showed that the usual definition based on some 
probabilistic assumptions prohibited the use of feedback. He then clarified the definition of memory and feedback 
in a communication channel. As a consequence, he showed that in a general channel used with feedback, the usual 
definition of capacity that relies on mutual information was not adequate. Instead, the right measure was shown to 
be directed information, an asymmetrical measure of the flow of information. These ideas were further examined 
by Kramer, who introduced the concept of causal conditioning, and who developed the first applications of directed 
information theory to communication in networks B31 . After some years, the importance of causal conditioning 
for the analysis of communication in systems with feedback was realized. Many studies were then dedicated to 
the analysis of the capacity of channels with feedback and the dual problem of rate-distortion theory [86|, l8~71 . 
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(91], [42 1 . Due to the rapid development in the study of networks {e.g., social networks, neural networks) and of 
the afferent connectivity problem, more recently many authors made connections between information theory and 
Granger causality 0, ll83l . [4|, |5|, [9|, J6), IT741 . Some of these studies were restricted to the Gaussian case, and to 
the bivariate case. Most of these studies did not tackle the problem of instantaneous coupling. Furthermore, several 
authors realized the importance of directed information theory to assess the circulation of information in networks 

ID, E3, ED- 

C. Outline 

Tools from directed information theory appear as natural measures to assess Granger causality. Although Granger 
causality can be considered as a powerful theoretical framework to study influences between signals mathematically, 
directed information theory provides the measures to test theoretical assertions practically. As already mentioned, 
these measures are transfer entropy (and its conditional versions), which assesses the dynamical part of Granger 
causality, and instantaneous information exchange (and its conditional versions), which assesses instantaneous 
coupling. 

This review is structured here as follows. We will first give an overview of the definitions of Granger causality. 
These are presented in a multivariate setting. We go gradually from weak definitions based on prediction, to strong 
definitions based on conditional independence. The problem of instantaneous coupling is then discussed, and we 
show that there are two possible definitions for it. Causality graphs (after Eichler [18]) provide particular reasons to 
prefer one of these definitions. Section [III] introduces an analysis Granger causality from an information-theoretic 
perspective. We insist on the concept of causal conditioning, which is at the root of the relationship studied. Section 
|lV]then highlights the links. Here, we first restate the definitions of Granger causality using concepts from directed 
information theory. Then from of a different point of view, we show how conceptual inference approaches lead 
to the measures defined in directed information theory. The review then closes with a discussion of some of the 
aspects that we do not present here intentionally, and on some lines for further research. 

D. Notations 

All of the random variables, vectors and signals considered here are defined in a common probability space 
(Ci,B,P). They take values either in K or R d , d being some strictly positive integer, or they can even take 
discrete values. As we concentrate on conceptual aspects rather than technical aspects, we assume that the variables 
considered are 'well behaved'. In particular, we assume finiteness of moments of sufficient order. We assume that 
continuously valued variables have a measure that is absolutely continuous with respect to the Lebesgue measure 
of the space considered. Hence, the existence of probability density functions is assumed. Limits are supposed to 
exist when needed. All of the processes considered in this report are assumed to be stationary. 

We work with discrete time. A signal will generically be denoted as x(k). This notation stands also for the value 
of the signal at time k. The collection of successive samples of the signal, Xf.,Xk+x, ■ ■ ■ iXk+n will be denoted as 
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^fc+n often, an initial time will be assumed. This can be 0, 1, or -oo. In any case, if we collect all of the sample 
of the signals from the initial time up to time n, we will suppress the lower index and write this collection as x n . 

When dealing with multivariate signals, we use a graph-theoretic notation. This will simplify some connections 
with graphical modeling. Let V be an index set of finite cardinality |V|. xy = {xy(k),k € Z} is a d-dimensional 
discrete time stationary multivariate process for the probability space considered. For aeV, x a is the corresponding 
component of xy. Likewise, for any subset A c V, xa is the corresponding multivariate process (x ai , . . . ,x\a\)- 
We say that subsets A, B, C form a partition of V if they are disjoint and if A u B u C = V. The information 
obtained by observing xa up to time k is resumed by the filtration generated by {xa(1), VZ < k}. This is denoted 
as x k A . Furthermore, we will often identify xa with A in the discussion. 

The probability density functions (p.d.f.) or probability mass functions (p.m.f) will be denoted by the same 
notation as p{x 7 \). The conditional p.d.f. and p.m.f. are written as p{x r ) L \x r ^). The expected value is denoted as 
E[.], E x [.] or E p [.] if we want to specify which variable is averaged, or under which probability measure the 
expected value is evaluated. 

Independence between random variables and vectors x and y will be denoted as x 11 y, while conditional 
independence given z will be written as x A y \ z. 

II. Granger's causality 

The early definitions followed the ideas of Wiener: A signal x causes a signal y if the past of x helps in the 
prediction of y. Implementing this idea requires the performing of the prediction and the quantification of its 
quality. This leads to a weak, but operational, form of the definitions of Granger causality. The idea of improving 
a prediction is generalized by encoding it into conditional dependence or independence. 

A. From prediction-based definitions. . . 

Consider a cost function g : R fe — > R (k is some appropriate dimension), and the associated risk E[g(e)], 
where e stands for an error term. Let a predictor of x B (n) be defined formally as 5fg(n + 1) = f(x r \), where 
A and B are subsets of V, and / is a function between appropriate spaces, chosen to minimize the risk with 
e(n) ■■= xb(it. + 1) - 5fg(n + 1). Solvability may be granted if / is restricted to an element of a given class of 
functions, such as the set of linear functions. Let F be such a function class. Define: 

Rr{B{n + l)\A n ) = inf E[g(x B (n + 1) - f(x n A ))] (1) 

Rjr{B{n+ is therefore the optimal risk when making a one-step-ahead prediction of the multivariate signal 

Xb from the past samples of the multivariate signal xa- We are now ready to measure the influence of the past of 
a process on the prediction of another. To be relatively general and to prepare comments on the structure of the 
graph, this can be done for subsets of V. We thus choose A and B to be two disjoint subsets of V, and we define 
C := V\(A u B) (we use \ to mean substraction of a set). We study causality from xa to xb by measuring the 
decrease in the quality of the prediction of Xb(ti) when excluding the past of xa- 
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Let Rf(B(ti + 1) V") be the optimal risk obtained for the prediction of xb from the past of all of the signals 
grouped in xy. This risk is compared to Rj={B{n + l)|(V\A) n ), where the past of xa is omitted. Then, for the 
usual costs functions, we have necessarily: 

Rr(B(n + l)\V n ) < Rr(B(n + l)\(V\A) n ) (2) 

A natural first definition for Granger causality is: 

Definition 1. xa Granger does not cause xb relative to V if and only if Rjr(B(n + 1)]^™) = R^[B(n + 

i)|(vw) 

This definition of Granger causality depends on the cost g chosen as well as on the class T of the functions 
considered. Usually, a quadratic cost function is chosen, for its simplicity and for its evident physical interpretation 
(a measure of the power of the error). The choice of the class of functions T is crucial. The result of the causality test 
in definition 1 can change when the class is changed. Consider the very simple example of x n+ i = ax n + (3y n +£„ + i, 
where y n and e„ are Gaussian independent and identically distributed (i.i.d.) sequences that are independent of each 
other. The covariance between x n +i and y n is zero, and using the quadratic loss and the class of linear functions, 
we conclude that y does not Granger cause x, because using a linear function of x n , y-n to predict x would lead to 
the same minimal risk as using a linear function of x n only. However, y n obviously causes x n , but in a nonlinear 
setting. 

The definition is given using the negative of the proposition. If by using the positive way, i.e., Rjr(B(n+l)\V n } < 
Rp(B(n + l)\(V\A) n \ Granger proposes to say that xa is a prima facie cause of xb relative to V, prima facie 
can be translated as 'at a first glance'. This is used to insist that if V is enlarged by including other measurements, 
then the conclusion might be changed. This can be seen as redundant with the mention of the relativity to the 
observation set V, and we therefore do not use this terminology. However, a mention of the relativity to V must 
be used, as modification of this set can alter the conclusion. A very simple example of this situation is the chain 
x n ~* y n ~* z n , where, for example, x n is an i.i.d. sequence, y n+1 = x n + e n+1 , z n+1 = y n + r) n+ x, e n ,r\ n being 
independent i.i.d. sequences. Relative to V = {x,z}, x causes z if we use the quadratic loss and linear functions 
of the past samples of x (note here that the predictor z n+1 must be a function of not only x n , but also of x n -x). 
However, if we include the past samples of y and V = {x,y,z}, then the quality of the prediction of z does not 
deteriorate if we do not use past samples of x. Therefore, x does not cause z relative to V = {x,y,z}. 

The advantage of the prediction-based definition is that is leads to operational tests. If the quadratic loss is 
chosen, working in a parameterized class of functions, such as linear filters or Volterra filters, or even working 
in reproducing kernel Hilbert spaces, allows the implementation of the definition ll58l . (7), J8). In such cases, the 
test needed can be evaluated efficiently from the data. From a theoretical point of view, the quadratic loss can be 
used to find the optimal function in a much wider class of functions: the measurable functions. In this class, the 
optimal function for the quadratic loss is widely known to be the conditional expectation ll52l . When predicting 
Xb from the whole observation set V, the optimal predictor is written as xs{n + 1) = E[xs(n + Likewise, 
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elimination of A from V to study its influence on B leads to the predictor x~B~(n+l) = i?[a;B(ri + l)|xg,a;g,l, where 
V = C u A u B. These estimators are of little use, because they are too difficult, or even impossible, to compute. 
However, they highlight the important of conditional distributions p(xs(n + and p{xs{n + 1)\x b ,Xq) in 

the problem of testing whether xa Granger causes x b relative to V or not. 

B. . ..to a probabilistic definition 

The optimal predictors studied above are equal if the conditional probability distributions p{xs{n + l)|xy) and 
p(xB(n + 1)\x b ,Xq) are equal. These distributions are identical if and only if xgfji + 1) and x n A are independent 
conditionally to x b ,Xq- A natural extension of definition 1 relies on the use of conditional independence. Once 
again, let A u B u C be a partition of V. 

Definition 2. xa does not Granger cause xb relative to V if and only if xb(u + 1) ii x' a A \ x B ,x^, Vn e Z 

This definition means that conditionally to the past of xc, the past of xa does not bring more information about 
xs{n + 1) than is contained in the past of xb- 

Definition 2 is far more general than definition 1 . If x a does not Granger cause x b relatively to V in the sense 
of definition 1, it also does not in the sense of definition 2. Then, definition 2 does not rely on any function class 
and on any cost function. However, it lacks an inherent operational character: the tools to evaluate conditional 
independence remain to be defined. The assessment of conditional independence can be achieved using measures of 
conditional independence, and some of these measures will be the cornerstone to link directed information theory 
and Granger causality. 

Note also that the concept of causality in this definition is again a relative concept, and that adding or deleting 
data from the observation set V might modify the conclusions. 

C. Instantaneous coupling 

The definitions given so far concern the influence of the past of one process on the present of another one. This 
is one reason that justifies the use of the term 'causality', when the definitions are actually based on statistical 
dependence. For an extensive discussion on the differences between causality and statistical dependence, we refer 
to Il69l 

There is another influence between the processes that is not taken into account by definitions 1 and 2. This influ- 
ence is referred to as 'instantaneous causality' [28|, 1 30 1 . However, we will use our preferred term of 'instantaneous 
coupling', specifically to insist that it is not equivalent to a causal link per se, but actually a statistical dependence 
relationship. The term 'contemporaneous conditional independence' that is used in ifTSI could also be chosen. 

Instantaneous coupling measures the common information between xa(ti+1) and xs(n+l) that is not shared with 
their past. A definition of instantaneous coupling might then be that XA(n + l) and xs{n + l) are not instantaneously 
coupled if xa(ti+1) ii xb(ti+1) | x A , x B , Vn. This definition makes perfect sense if the observation set is reduced 
to A and B, a situation we refer to as the bivariate case. However, in general, there is also side information C, and 
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the definition must include this knowledge. However, this presence of side information then leads to two possible 
definitions of instantaneous coupling. 

Definition 3. x a and xb are not conditionally instantaneously coupled relative to V if and only if xa(ti + 1) It 
Xs(n + 1) | x^jXgjXff 1 , Vn e Z, where Au B uC is a partition of V. 

The second possibility is the following: 

Definition 4. xa and xb are not instantaneously coupled relative to V if and only if xa(ti + 1) 11 xs{n + 1) | 
x x q i x ^ V?t< € 2j 

Note that definitions 3 and 4 are symmetrical in A and £? (the application of Bayes theorem). The difference 
between definitions 3 and 4 resides in the conditioning on instead of Xq. 

If the side information up to time n is considered only as in definition 4, the instantaneous dependence or 
independence is not conditional on the presence of the remaining nodes in C. Thus, this coupling is a bivariate 
instantaneous coupling: it does measure instantaneous dependence (or independence between A and B) without 
considering the possible instantaneous coupling between either A and C or B and C. Thus, instantaneous coupling 
found with definition 4 between A and B does not preclude the possibility that the coupling is actually due to 
couplings between A and C and/or B and C. 

Inclusion of all of the information up to time n + 1 in the conditioning variables allows the dependence or 
independence to be tested between xa(ji + 1) and xs(n + 1) conditionally to xc(n + 1). 

We end up here with the same differences as those between correlation and partial correlation, or dependence and 
conditional independence for random variables. In graphical modeling, the usual graphs are based on conditional 
independence between variables ll94l . [ 50 1 . These conditional independence graphs are preferred to independence 
graphs because of their geometrical properties ( e.g., d-separation, ll69l ). which match the Markov properties possibly 
present in the multivariate distribution they represent. From a physical point of view, conditional independence might 
be preferable, specifically to eliminate 'false' coupling due to third parties. In this respect, conditional independence 
is not the panacea, as independent variables can be conditionally dependent. The well-known example is the 
conditional coupling of independent x and y by their addition. Indeed, even if independent, x and y are conditionally 
dependent to z = x + y. 

D. More on graphs 

Granger causality graphs were defined and studied in [18|. A causality graph is a mixed graph (V, Ed, E u ) that 
encodes Granger causality relationships between the components of xy- The vertex set V stores the indexes of 
the components of xy- Ed is a set of directed edges beween vertices. A directed edge from a to b is equivalent 
to "x a Granger causes Xb relatively to V". E u is a set of undirected edges. An undirected edge between x a 
and Xb is equivalent to "x a and Xb are (conditionally if def.4 adopted) instantaneously coupled". Interestingly, a 
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Granger causality graph may have Markov properties (as in usual graphical models) reflecting a particular (spatial) 
structure of the joint probability distribution of the whole process {x v } ifTHl . A taxonomy of Markov properties: 
local, global, block recursive is studied in |18|, and equivalence between these properties is put forward. More 
interestingly, these properties are linked with topological properties of the graph. Therefore, structural properties 
of the graphs are equivalent to a particular factorization of the joint probability of the multivariate process. We 
will not continue on this subject here, but this must be known since it paves the way to more efficient inference 
methods for Granger graphical modeling of multivariate processes (see a first step in this direction in 11731 ). 

III. Directed information theory and directional dependence 

Directed information theory is a recent extension of information theory, even if its roots go back to the 1960's and 
1970's and the studies of Marko l57l . The developments began in the late 1990's, after the impetus given by James 
Massey in 1990 [59|. The basic theory was then extended by Gerhard Kramer ||45l , and then further developed by 
many authors ]86ll , lf87l , 11911 , (42), 117011 to cite a few. We provide here a short review of the essentials of directed 
information theory. We will, moreover, adopt a presentation close to the spirit of Granger causality to highlight 
the links between Granger causality and information theory. We begin by recalling some basics from information 
theory. Then, we describe the information-theoretic approach to study directional dependence between stochastic 
processes, first in the bivariate case, and then, from section ITlI-El for networks, i.e., the multivariate case. 

A. Notation and basics 

Let H(x A ) = ~E[logp(x A )] be the entropy of a random vector x A , the density of which is p. Let the conditional 
entropy be defined as H(x A \x B ) = -E[logp(x A \x B )]. The mutual information I(x A ;y B ) between x A and x B is 
defined as H3: 

I(x n A ;x B ) = H{x n B )-H{x n B \x A ) 

= D KL (p(x A ,x B )\\p(x A )p(x B )) (3) 

where D KL (p\\q) = E p [logp(x) / q(x)] is the Kulback-Leibler divergence. D KL (p\\q) is if and only if p = q, and 
it is positive otherwise. The mutual information effectively measures independence since it is if and only if x\ 
and x B are independent random vectors. As I(x A ;x B ) = I(y B ;x A ), mutual information cannot handle directional 
dependence. 

Let Xq be a third time series. It might be a multivariate process that accounts for side information (all of the 
available observations, but x A and x B ). To account for x^, the conditional mutual information is introduced: 

I{x n A,y n B \x n c ) = E[D KL (p{x A ,yl\xl)\\p{x A \xl)p{y^\x n c ))] (4) 
= D KL {p{x A ,yl,xm P {x n A \x n c )p{y%\x n c )p{x n c )) (5) 

I(x A ; y B \x^) is zero if and only if x n A and y B are independent conditionally to Xq. Stated differently, conditional 
mutual information measures the divergence between the actual observations and those which would be observed 
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under the Markov assumption (x -*■ z -* y). Arrows can be misleading here, as by reversibility of Markov chains, 
the equality above holds also for (y -*■ z -* x). This emphasizes how mutual information cannot provide answers 
to the information flow directivity problem. 

B. Directional dependence between stochastic processes; causal conditioning 

The dependence between the components of the stochastic process xy is encoded in the full generality by the joint 
probability distributions p(x v ). If V is partitioned into subsets A,B,C, studying dependencies between A and B 
then requires that p(x v ) is factorized into terms where x a and xb appear. For example, as p(x v ) = p(x\, x b ,Xq), 
we can factorize the probability distribution as p(x B \x A , Xq)p(x a , Xq), which appears to emphasize a link from A 
to B. Two problems appear, however: first, the presence of C perturbs the analysis (more than this, A and C have a 
symmetrical role here); secondly, the factorization does not take into account the arrow of time, as the conditioning 
is considered over the whole observations up to time n. 

Marginalizing xc out makes it possible to work directly on p(x A ,x B ). However, this eliminates all of the 
dependence between A and B that might exist via C, and therefore this might lead to an incorrect assessment 
of the dependence. As for Granger causality, this means that dependence analysis is relative to the observation 
set. Restricting the study to A and B is what we referred to as the bivariate case, and this allows the basic ideas 
to be studied. We will therefore present directed information first in the bivariate case, and then turn to the full 
multivariate case. 

The second problem is at the root of the measure of directional dependence between stochastic processes. 
Assuming that XA(n) and xs(n) are linked by some physical (e.g., biological, economical) system, it is natural 
to postulate that their dependence is constrained by causality: if A -> B, then an event occurring at some time in 
A will influence B later on. Let us come back to the simple factorization above for the bivariate case. We have 
p(x%x B ) = p(x B \x A )p(x A ), and furthermore]: 

n 

v{x n B \x n A ) = ripMohk 1 ,^) ( 6 ) 

i=l 

where for i = 1, the first term is p(xb(1)\xa(1))- The conditional distribution quantifies a directional dependence 
from A to B, but it lacks the causality property mentioned above, as p(xB(i)\x B ~ 1 , quantifies the influence of 
the whole observation x n A (past and future of i) on the present xb(i) knowing its past x B . The causality principle 
would require the restriction of the prior time i to the past of A only. Kramer defined 'causal conditioning' precisely 
in this sense l45ll . Modifying Eq. © accordingly, we end up we the definition of the causal conditional probability 
distribution: 

n 

P(x n B \\x n A ) := Y\p(x B {i)W ,x\) (7) 
i=i 

'We implicitly choose 1 here as the initial time. 
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Remarkably this provides an alternative factorization of the joint probability. As noted by Massey |59|, p(x A ,y B ) 
can then be factorized aj^[ 

p(x A ,x n B ) = p(x B \\x n A )p(x n A \\x B - 1 ) (8) 

Assuming that xa is the input of a system that creates xb, pC^aII^b" 1 ) = TliP(xA(i)\x A ~\x l B ) characterizes the 
feedback in the system: each of the factors controls the probability of the input xa at time i conditionally to its 
past and to the past values of the output xb- Likewise, the term p{x B \x A ) = riiP(a ; s(OI ;z: B 1 ; x a) characterizes 
the direct (or feedforward) link in the system. 
Several interesting simple cases occur: 

• In the absence of feedback in the link from A to B, there is the following: 

pixA^A 1 ,^ 1 ) =p(x A (t)\x l A 1 ) 1 Vz > 2 (9) 
or equivalently, in terms of entropies, 

H(x A (i)\x i A 1 ,x i B 1 )=H{x A (i)\x i X 1 ), V* > 2 (10) 

and as a consequence: 

p(x A \\x B - 1 )=p(x n A ) (11) 
. Likewise, if there is only a feedback term, then pixs^lx 1 ^ 1 , x A ) = p(xg(i)\x B ) and then: 

p(x B \\x A )=p(x B ) (12) 
. If the link is memoryless, i.e., the output x B does not depend on the past, then: 

p{x B {i)\x l A ,y l B 1 ) =p{x B {i)\x A {i)) W > 1 (13) 

These results allow the question of whether x a influences xb to be addressed. If it does, then the joint distribution 
has the factorization of Eq. (0. However, if xa does not influence xb, then p(x B \\x A ) = p(x B ), and the factorization 
of the joint probability distribution simplifies to p(x A \\x B ~ 1 )p(x B ). Kullback divergence between the probability 
distributions for each case generalizes the definition of mutual information to the directional mutual information: 

I(x A - a£) = D KL {p{x% x B )\\p{x'X Wx^Mxl)) (14) 

This quantity measures the loss of information when it is incorrectly assumed that x A does not influence x B . This 
was called directed information by Massey |59|. Expanding the Kullback divergence allows different forms for the 
directed information to be obtained: 

n 

i{*A-+x n B ) = £/(4;z B «|4 _1 ) (15) 

= H(x B )-H(x n B \\x n A ) (16) 

2 x^f 1 stands for the delayed collections of samples of xg. If the time origin is finite, or 1, the first element of the list x 1 ^ 1 should be 
understood as a wild card which does not influence the conditioning. 
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where we define the 'causal conditional entropy': 

H(x B \\x n A ) = -E[logp(x B \\x n A )] (17) 

n 

= *Z h ( x bW\xb\x\) (18) 

i=l 

Note that causal conditioning might involve more than one process. This leads to the defining of the causal 
conditional directed information as: 

I(x n A ^x B \\x n c ) := H(x B \\x n c )-H(x B \\x Al xl) 
n 

= Y l I ( x A;xB(i)\x 4 B\xc) (19) 

i=l 

The basic properties of the directed information were studied by Massey and Kramer [59], [60|, [45 1, and some 
are recalled below. As a Kullback divergence, the directed information is always positive or zero. Then, simple 
algebraic manipulation allows the decomposition to be obtained: 

Itxl^xD+Iixl-^x^) = I{x n A -x B ) (20) 

Eq. (1201 1 is fundamental, as it shows how mutual information splits into the sum of a feedforward information flow 
I(x A -*■ x B ) and a feedback information flow I(x B -> x A ). In the absence of feedback, p(x A \\x r jf 1 ) = p(x A ) 
and I(x r A ;x B ) = I(x A -*■ x%). Eq. (|20| > allows the conclusion that the mutual information is always greater than 
the directed information, as /(x^ 1 -> x A ) is always positive or zero (as directed information). It is zero if and 
only if: 

I(x A (i);x^ 1 \x i A - 1 )=0 Vi = 2,...,n (21) 

or equivalently: 

H(x A (i)\x* A \x l B 1 )=H(x A (i)\x' l A 1 ) Vt = 2,...,n (22) 

This situation corresponds to the absence of feedback in the link A -> B, whence the fundamental result that the 
directed information and the mutual information are equal if the channel is free of feedback. This result implies 
that mutual information over-estimates the directed information between two processes in the presence of feedback. 
This was thoroughly studied in B31 . j86|, [91 1, [87 1, in a communication-theoretic framework. 

The decomposition of Eq. (f20b is surprising, as it shows that the mutual information is not the sum of the directed 
information flowing in both directions. Instead, the following decomposition holds: 

I(x n A ^x B )+ I(x B ^x n A ) = nx^xD + nxX^x^jxT 1 ) (23) 

where: 

I{x n A ^x n B \\x n A l ) = E/^MOIs* 1 ,^ 1 ) 

i 

= Y. I ^A{i);x B {i)\x i B \x i A l ) (24) 
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This demonstrates that I(x A -*■ x B ) + I(x B -*■ x A ) is symmetrical, but is in general not equal to the mutual 
information, except if and only if 7(a; J 4(i);xB(i)|x^ 1 ,x^ 1 ) = 0, Vi = l,...,n. As the term in the sum is the 
mutual information between the present samples of the two processes conditioned on their joint past values, this 
measure is a measure of instantaneous dependence. It is indeed symmetrical in A and B. The term I(x A -*■ 
x^Wx 7 ^ 1 ) = I(x B -* a^Hxjg -1 ) will thus be named the instantaneous information exchange between x a and xb, 
and will hereafter be denoted as I(x A *-> x B ). Like directed information, conditional forms of the instantaneous 
information exchange can be defined, as for example: 

I(x n A ~ x B \\x n c ) := I(x n A - x B \\x n A \x n c ) (25) 

which quantifies an instantaneous information exchange between A and B causally conditionally to C. 

C. Directed information rates 

Entropy and mutual information in general increase linearly with the length n of the recorded time series. 
Shannon's information rate for stochastic processes compensates for the linear growth by considering Aoo(x) = 
lim„^ +IXJ A(x n )/n ( if the limit exists), where A(x n ) denotes any information measure on the sample x n of length 
n. 

For the important class of stationary processes (see e.g., Ifl3l ). the entropy rate turns out to be the limit of the 
conditional entropy: 

Km -H(x A )= lim H{x A {n)\x\X X ) (26) 

jl-*+oo 77, n-*+oo 

Kramer generalized this result for causal conditional entropies [45 1, thus defining the directed information rate for 
stationary processes as: 

1 " 

Ioo{x A -*■ x B ) = lim - Y I(x l A ;x B (i)\xB ) 
n->+°o 7i f— J 

= lim I(x n A ;x B (n)\x B ~ l ) (27) 

This result holds also for the instantaneous information exchange rate. Note that the proof of the result relies on 
the positivity of the entropy for discrete valued stochastic processes. For continously valued processes, for which 
the entropy can be negative, the proof is more involved and requires the methods developed in [71], [33], [34], and 
see also ll87l . 

D. Transfer entropy and instantaneous information exchange 

As introduced by Schreiber in lUTO . ll3~7ll . transfer entropy evaluates the deviation of the observed data from a 
model, assuming the following joint Markov property: 

p(x B (n)\x B ^Zl + i,X A nZi + i) =p(x B (n)\x B ™zl +1 ) (28) 
This leads to the following definition: 

T(x A ™_l +1 ->■ x B "_ k+1 ) = E 



log 



p(x B (n)\x 



Bn-k+l^An-l+ 



x) 



(29) 
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Then T(x A ™zl+i ~~* x Bn-k+i) = if and only if Eq. d28l i is satisfied. Although in the original definition, the past 
of x in the conditioning might begin at a different time m^n, for practical reasons m = n is considered. Actually, 
no a priori information is available about possible delays, and setting m = n allows the transfer entropy to be 
compared with the directed information. 

By expressing the transfer entropy as a difference of conditional entropies, we get: 

T(%An-l+l ~* x Bn-k+l) = H{x B {jl)\x B ™-k+l) ~ H( X B ( n ) \x B n-fe+1 ' X A n-l+1 ) 

= I( x An-li, x B{n)\x B tl +1 ) (30) 

For I = n = k and choosing 1 as the time origin, the identity I(x, y; z\w) = I(x; z\w) + I(y; z\x, w) leads to: 

I{x n A ,x B {n)\x n B l ) = I{x A - 1 -x B {n)\x B - 1 )+I(x A {n)-,x B {n)\x A -\x B - 1 ) 

= Tix^^x^+IixAin^XBin)^ 1 ^ 1 ) (31) 

For stationary processes, letting n -*■ oo and provided the limits exist, for the rates, we obtain: 

Ioo{xa -*■ x B ) = T x (x A -> x B ) + Ioo(xa ^ x B ) (32) 

Transfer entropy is the part of the directed information that measures the influence of the past of xa on the present 
of x B - However it does not take into account the possible instantaneous dependence of one time series on another, 
which is handled by directed information. 

Moreover, as defined by Schreiber in |[8T| , (37], only I{x l A l \x B {i)\x l B l ) is considered in T, instead of its sum 
over i in the directed information. Thus stationarity is implicitly assumed and the transfer entropy has the same 
meaning as a rate. A sum over delays was considered by Palus as a means of reducing errors when estimating the 
measure [66|. Summing over n in Eq. QTb , the following decomposition of the directed information is obtained: 

I(x n A^x n B )=I(x A - 1 ^x B ) + I(x n A ^x B ) (33) 

Eq. d33l establishes that the influence of one process on another can be decomposed into two terms that account for 
the past and for the instantaneous contributions. Moreover, this explains the presence of the term I(x A *-> x B ) in the 
r.h.s. of Eq. ( 1231 : Instantaneous information exchange is counted twice in the l.h.s. terms I(x A x B ) + I(x B -*■ x A ), 
but only once in the mutual information I(x r A ;x B ). This allows Eq. ( 1231 to be written in a slightly different form, 
as: 

I(xT 1 ->xl)+I{xr 1 ->x'n + I(xl~xZ) = I{x n A ;x n B ) (34) 

which is very appealing, as it shows how dependence as measured by mutual information decomposes as the sum 
of the measures of directional dependences and the measure of instantaneous coupling. 

E. Accounting for side information 

The preceding developments aimed at the proposing of definitions of the information flow between xa and x B \ 
however, whenever A and B are connected to other parts of the network, the flow of information between A and 
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B might be mediated by other members of the network. Time series observed on nodes other than A and B are 
hereafter referred to as side information. The available side information at time n is denoted as Xq, with A,B,C 
forming a partition of V. Then, depending on the type of conditioning (usual or causal) two approaches are possible. 
Usual conditioning considers directed information from A to B that is conditioned on the whole observation x^. 
However, this leads to the consideration of causal flows from A to B that possibly include a flow that goes from A 
to B via C in the future! Thus, an alternate definition for conditioning is required. This is given by the definition 
of Eq. ( fT9] > of the causal conditional directed information: 

H x a^ x b\\ x c) := H(x b \\xq) - H(x b \\x a ,Xq) 

n 

= T, I { X A' X B{i)\ x B 1 ,Xc) (35) 

i=X 

Does the causal conditional directed information decompose as the sum of a causal conditional transfer entropy 
and a causal conditional instantaneous information exchange, as it does in the bivariate case? Applying twice the 
chain rule for conditional mutual information, we obtain: 

I(x A x B \\x c ) = I(x A ^x B \\x c ) +I(x A ^x B \\x c ) + AI(x c ^x B ) (36) 

In this equation, I{x r A 1 ->■ x^x 7 ^ 1 ) is termed the 'causal conditional transfer entropy'. This measures the flow of 
information from A to B by taking into account a possible route via C. If the flow of information from A to B is 
entirely relayed by C, the 'causal conditional transfer entropy' is zero. In this situation, the usual transfer entropy 
is not zero, indicating the existence of a flow from A to B. Conditioning on C allows the examination of whether 
the route goes through C. The term: 

I( X A x b\\ x c) := I( X A ~* x b\\ x A 1 t x c) (37) 
n 

= Y, I ( x A^);xB^)\x l B 1 ,x l A \x l c ) (38) 
1=1 

is the 'causal conditional information exchange'. This measures the conditional instantaneous coupling between A 
and B. The term AI(xq ** x B ) emphasizes the difference between the bivariate and the multivariate cases. This 
extra term measures an instantaneous coupling and is defined by: 

AI(x n c - x n B ) = I(x n c - xlWxX 1 ) ~ I{x n c - x n B ) (39) 

An alternate decomposition to Eq. (f36t is: 

I{x\ - x B \xl) = lixX 1 - x B \\x n c ) + I(x n A - x B \\x n c ) (40) 

which emphasizes that the extra term comes from: 

/(zT 1 - Xb\ x c) = H^a 1 - ^la;^ 1 ) + AJ(a£ - x n B ) (41) 

This demonstrates that the definition of the conditional transfer entropy requires conditioning on the past of C. If 
not, the extra term appears and accounts for instantaneous information exchanges between C and B, due to the 



November 15, 2012 



DRAFT 



19 



addition of the term xc{i) in the conditioning. This extra term highlights the difference between the two different 
natures of instantaneous coupling. The first term, 

I{xl^xl\x\- 1 ) = Y J I{xc{^x B {i)\x l A 1 y B 1 y c 1 ) (42) 

i 

describes the intrinsic coupling in the sense that it does not depend on parties other than C and B. The second 
coupling term, 

I{x n c ~x n B ) = Y J I ^c{i);x B {i)\x i B \x j 5 1 ) 

i 

is relative to the extrinsic coupling, as it measures the instantaneous coupling at time i that is created by variables 
other than B and C. 

As discussed in section IH-CI the second definition for instantaneous coupling considers conditioning on the past 
of the side information only. Causally conditioning on x^f 1 does not modify the results of the bivariate case. In 
particular, we still get the elegant decomposition: 

I(x A ^x B \\x c )=I(x A -+x B \\x c ) +I{x A ^ x B \\x c ) (43) 

and therefore, the decomposition of Eq. (f34-b is generalized to: 

I{x A -^x B \\x c )+I(x B -+x A \\x c ) +I(x A ^ x B \\x c ) = I(x A ;x B \\x c ) (44) 

where: 

I(xl ] x B \\xZ- 1 ) = Y, I ( x A;xB{i)\x B - 1 ,x£ 1 ) (45) 

i 

is the causally conditioned mutual information. 

Finally, let us consider that for jointly stationary times series, the causal directed information rate is defined 
similarly to the bivariate case, as: 

1 n 

Ioo(xa x B \x c ) = lim - V iix^x^i)^ 1 ,x l c ) (46) 

= lim l(x n A -x B (n)\x n B \x n c ) (47) 

In this section we have emphasized on Kramer's causal conditioning, both for the definition of directed information 
and for taking into account side information. We have also shown that Schreiber's transfer entropy is that part of the 
directed information that is dedicated to the strict sense of causal information flow (not accounting for simultaneous 
coupling). The next section more explicitely revisits the links between Granger causality and directed information 
theory. 

IV. Inferring Granger causality and instantaneous coupling 

Granger causality in its probabilistic form is not operational. In practical situations, for assessing Granger 
causality between time series, we cannot use the definition directly. We have to define dedicated tools to assess 
the conditional independence. We use this inference framework to show the links between information theory and 
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Granger causality. We begin by re-expressing Granger causality definitions in terms of some measures that arise 
from directed information theory. Therefore, in an inference problem, these measures can be used as tools for 
inference. However, we show in the following sections that these measures naturally emerge from the more usual 
statistical inference strategies. In the following, and as above, we use the same partitioning of V into the union of 
disjoint subsets of A, B and C. 

A. Information-theoretic measures and Granger causality 

As anticipated in the presentation of directed information, there are profound links between Granger causality 
and directed information measures. Granger causality relies on conditional independence, and it can also be defined 
using measures of conditional independence. Information-theoretic measures appear as natural candidates. Recall 
that two random elements are independent if and only if their mutual information is zero. Moreover, two random 
elements are independent conditionally to a third one if and only if the conditional mutual information is zero. We 
can reconsider definitions 2, 3 and 4 and recast them in term of information-theoretic measures. 

Definition 2 stated that xa does not Granger cause xb relative to V if and only if xb(ji+1) 11 x n A \ x Bl x 1 ^, Vn > 
1. This can be alternatively rephrased into: 

Definition 5. xa does not Granger cause x B relative to V if and only if /(x^ -1 -» ^bI^c rl ) =0 Vn > 1 

since xs{i) it x\ \ x 1 ^ 1 ,Xp X , VI < i < n is equivalent to I{xB{i);x\ \ x l j} ,x % c x ) =0 VI < i < n. 

Otherwise stated, the transfer entropy from A to B causally conditioned on C is zero if and only if A does not 
Granger cause B relative to V. This shows that causal conditional transfer entropy can be used to assess Granger 
causality. 

Likewise, we can give alternative definitions of instantaneous coupling. 

Definition 6. xa and x B are not conditionally instantaneously coupled relative to V if and only if I(x A <-»• 
x n B |x£)Vn> 1, 

or if and only if the instantaneous information exchange causally conditioned on C is zero. The second possible 
definition of instantaneous coupling is equivalent to: 

Definition 7. xa and xb are not instantaneously coupled relative to V if and only if I{x a A _1 )Vn> 1, 

or if and only if the instantaneous information exchange causally conditioned on the past of C is zero. 

Note that in the bivariate case only (when C is not taken into account), the directed information I(x A -*■ x B ) 
summarizes both the Granger causality and the coupling, as it decomposes as the sum of the transfer entropy 
-»■ x B ) and the instantaneous information exchange /(x^ 1 <-> x B )- 

B. Granger causality inference 

We consider the practical problem of inferring the graph of dependence between the components of a multivariate 
process. Let us assume that we have measured a multivariate process xy(n) for n < T. We want to study the 
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dependence between each pair of components (Granger causality and instantaneous coupling between any pair of 
components relative to V). 

We can use the result of the preceding section to evaluate the directed information measures on the data. When 
studying the influence from any subset A to any subset B, if the measures are zero, then there is no causality (or 
no coupling); if they are strictly positive, then A Granger causes B relative to V (or A and B are coupled relative 
to V). This point of view has been adopted in many of the studies that we have already referred to {e.g. [37|, ll36l . 
||67l , IT741 , [92 1), and it relies on estimating the measures from the data. We will not review the estimation problem 
here. 

However, it is interesting to examine more traditional frameworks for testing Granger causality, and to examine 
how directed information theory naturally emerges from these frameworks. To begin with, we show how the measures 
defined emerge from a binary hypothesis-testing view of Granger causality inference. We then turn to prediction 
and model-based approaches. We will review how Geweke's measures of Granger causality in the Gaussian case 
are equivalent to directed information measures. We will then present a more general case adopted by [26|, |77|, 
||43l , Il44l . 11741 and based on a model of the data. 

1 ) Directed information emerges from a hypotheses-testing framework: In the inference problem, we want to 
determine whether or not xa Granger causes (is coupled with) or not xb relative to V. This can be formulated as 
a binary hypothesis testing problem. For inferring dependencies between A and B relative to V, we can state the 
problem as follows. 

Assume we observe xv(n), Vn < T. Then, we want to test: 'xa does not Granger cause xb\ against 'xa causes 
Xb'', and 'xa and xb are instantaneously coupled' against 'xa are xb not instantaneously coupled'. We will refer 
to the first test as the Granger causality test, and to the second one, as the instantaneous coupling test. 

In the bivariate case, for which the Granger causality test indicates: 

f H : po^bWI^aVb 1 ) = p{x B (i)\x i B - 1 ),Vi<T 

< (4a) 
[Hj. : p x {x B {i)\x^\x^ 1 ) = p{x B (i)\x^\x l B - l ),Vi<T 

this leads to the testing of different functional forms of the conditional densities of xb{i) given the past of xa- 

The likelihood of the observation under Hi is the full joint probability p(x^, x^) = p(x T A ||a:^)p(a;^ Ix^ 1 ). Under 

H we have p(x]^\\xa^ 1 ) = and the likelihood reduces to p(x A \\x] 3 )p(x^\\x A ^ 1 ) = p(x^ \\x] 3 )p(x'j 3 ) . The 

log likelihood ratio for the test is: 

iom) - log 444^ = p -^p (^) 

p{x 1 A ,x J B \H ) p{x M B ) 
t p(xb^\x* a \x^) 

For example, in the case where the multivariate process is a positive Harris recurrent Markov chain [61 1, the law 
of large numbers applies and we have under hypothesis H\. 

— 1(xa,Xb) t ~ >+ °°> T^xa -» xb) a.s. (51) 
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where T 00 (xa -*■ %b) is the transfer entropy rate. Thus from a practical point of view, as the amount of data 
increases, we expect the log likelihood ratio to be close to the transfer entropy rate (under Hi). Turning the point 
of view, this can justify the use of an estimated transfer entropy to assess Granger causality. Under Ho, ^Z(x^,a;^) 
converges to limT^+oo(l/T)DKL(p(x^\\x] 3 )p(x] 3 )\\p(x^\\x] 3 )p(x] 3 \\x A ^ 1 )), which can be termed 'the Lautum 
transfer entropy rate' that extends the 'Lautum directed information' defined in |70) . Directed information can be 
viewed as a measure of the loss of information when assuming x a does not causally influence xb when it actually 
does. Likewise, 'Lautum directed information' measures the loss of information when assuming xa does causally 
influence x B , when actually it does not. 

For testing instantaneous coupling, we will use the following: 

\ Ho ■ po(xA(i),x B (i) \ x i A 1 1 x i B 1 ) = p(x A (i)\x l A \x i B 1 )p(x B (i)\x l A \x i B 1 ),Vi<T 
[Hi : p 1 {x A {i),x B {i)\x' l X 1 ,x l B 1 ) = p(x A (i),x B (i)\x l A 1 ,x l B 1 ),\/i<T 
where under Ho, there is no coupling. Then, under Hi and some hypothesis on the data, the likelihood ratio 
converges almost surely to the information exchange rate Ioo{xa x B ). 

A related encouraging result due to |70| is the emergence of the directed information in the false-alarm probability 
error rate. Merging the two tests d48l>.(f52b. i.e., testing both for causality and coupling, or neither, the test is written 
as: 

Hq : po(x B (i)\x\,x l B 1 ) = p(x B (i)\x l B 1 )^i<T 
Hi ■■ pi(x B (i)\x l A ,x l B 1 ) = p(x B (i)\x l A ,x l B 1 ),Mi<T 
Among the tests with a probability of miss Pm that is lower than some positive value e > 0, the best probability of 
false alarm Pfa follows exp( - TI(xa -*■ xb)) when T is large. For the case studied here, this is the so-called 
Stein lemma fl3l . 

In the multivariate case, there is no such result in the literature. An extension is proposed here. However, this is 
restricted to the case of instantaneously uncoupled time series. Thus, we assume for the end of this subsection that: 

p(x A (i),x B (i),xc(i)\x l A 1 ,x t B \x l c 1 )= JJ p(x a (i) | x^ 1 , x l B x , x^ 1 ), Vi < T (54) 

a=A,B,C 

which means that there is no instantaneous exchange of information between the three subsets that form a partition 
of V. This assumption has held in most of the recent studies that have applied Granger causality tests. It is, however, 
unrealistic in applications where the dynamics of the processes involved are faster than the sampling period adopted 
(see 1 30 1 for a discussion in econometry). Consider now the problem of testing Granger causality of A on B relative 
to V. The binary hypothesis test is given by: 

H : poCzbWI^VbVc 1 ) = P( x B(i) \ ^g 1 , x^ 1 ), Mi <T 
Hi ■■ V i(x B {i)\x i X 1 ,x i B 1 ,x i c 1 ) = p(x B (i)\x i A 1 ,x i B -\x i c l ),Vi<T 



The log likelihood ratio reads as: 



Kxi^xD - Eiog ^ v ; v V-;f ' (56) 

fcl p(x B {i)\x B 1 ,x} } 1 ) 
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Again, by assuming that the law of large numbers applies, we can conclude that under Hi 

yK x a-> x b-> x c) T " + °°' Too(xa -*■ x B \\xc) a.s. (57) 

This means that the causal conditional transfer entropy rate is the limit of the log likelihood ratio as the amount 
of data increases. 

2) Prediction based approach in the Gaussian case: Following definition 1 and focusing on the quadratic risk 

R(e) = E[e 2 ], Geweke introduced the following indices for the study of Gaussian stationary processes [22], [23 1: 

R{x B {n)\x B -\ XA -') 
Fxa^xb - lim — — — (38) 

n^+°o R(XB\n)\X B tX a ) 



R{x B {n)\x B -\x\-\xl) 
^x A ^x B \\x c - um — , — ; — — j — — (59) 



h B i ^ A -i^CI 

n-Too R(x B {n)\x n £ x ,x\~ x ,x n c ) 



R{x B {n)\x n B x ) 



RjxBjn)^- 1 ^^ 1 ) 
RixBin^-^xX^x^- 1 ) 



FxA^XnWxr, = I™ "V-^VV! B > C / (61) 



Geweke demonstrated the efficiency of these indices for testing Granger causality and instantaneous coupling 
(bivariate and multivariate cases). Furthermore, in the bivariate case, Geweke showed that: 

Fxa^xb +Fx b ^xa +F XA „ XB = Ioo{x A \x B ) (62) 

where I 00 (xa',xb) is the mutual information rate. This relationship that was already sketched out in (28), is 
nothing but Eq. d34l >. Indeed, in the Gaussian case, F XA ++ XB = I 00 {xa ** is) and F Xa ^ Xb = Ioo(xa -*■ x B ) 
stem from the knowledge that the entropy rate of a Gaussian stationary process is the logarithm of the asymptotic 
power of the one-step-ahead prediction lfl3l . Likewise, we can show that F Xa ^ Xb \\ Xc = I 00 (xa ■*-> xb\\xc) and 
F Xa -> Xb \\x c = Ioo(x A -»■ xb\\xc) holds. 

In the multivariate case, conditioning on the past of the side information, i.e. Xq , in the definition of F XA ^ XB ^ XC , 
a decomposition analagous to Eq. d62l > holds, and is exactly that given by Eq. d44l >. 

3) The model-based approach : In a more general framework, we examine how a model-based approach can be 
used to test for Granger causality, and how directed information comes into play. 

Let us consider a rather general model in which xy(t) is a multivariate Markovian process that statisfies: 

x v (t) = fe{x v tl)+w v {t) (63) 

where fg : R^' — > M' y ' is a function belonging to some functional class T, and where u>v is a multivariate 
i.i.d. sequence, the components of which are not necessarily mutually independent. Function fg might (or might 
not) dependon 9, a multidimensional parameter. This general model includes as a particular case, linear multivariate 
autoregressive with moving average (ARMA) models, and nonlinear ARMA models; fg can also stand for a function 
belonging to some reproducing kernel Hilbert space, which can be estimated from the data ll79l . J58), (8). Using 
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the partition A,B,C, this model can be written equivalently as: 

XA(t) = !a,b a {xju'k , x B l t zl , x c lzl ) + w A (t) 
■ x B (t) = fB,e B (x At t^x B ^,x c *^)+w B (t) (64) 

. XC (t) = fcfic {x A t-k > &Irt~k . ^cZ-fc ) + WC (*) 

where the functions / e are the corresponding components of fg. This relation can be used for inference in a 
parametric setting: the functional form is assumed to be known and the determination of the function is replaced 
by the estimation of the parameters 9a.b,c- This can also be used in a nonparametric setting, in which case the 
function / is searched for in an appropriate functional space, such as an rkHs associated to a kernel [79]. 

In any case, for studying the influence of xa to xb relative to V, two models are required for x b ■ one in which 
x B explicitly depends on xa, and the other one in which x B does not depend on xa- In the parametric setting, 
the two models can be merged into a single model, in such a way that some components of the parameter 9 B are, 
or not, zero, which dependis whether A causes B or not. The procedure then consists of testing nullity (or not) of 
these components. In the linear Gaussian case, this leads to the Geweke indices discussed above. In the nonlinear 
(nonGaussian) case, the Geweke indices can be used to evaluate the prediction in some classes of nonlinear models 
(in the minimum mean square error sense). In this latter case, the decomposition of the mutual information, Eq. 
d62i >. has no reason to remain valid. 

Another approach base relies on directly modeling the probability measures. This approach has been used recently 
to model spiking neurons and to infer Granger causality between several neurons working in the class of generalized 
linear models ll74l . (44 1. Interestingly, the approach has been used either to estimate the directed information l74l 
or to design a likelihood ratio test ll26l . PBll . Suppose we wish to test whether 'xa Granger causes x B relative 
to V as a binary hypothesis problem, as in section II V-B 1 1 Forgetting the problem of instantaneous coupling, the 
problem is then to choose between the hypotheses: 

f H Q : Po (x B (i) \x l v 1 ) = p(x B (i)\x\; 1 ;6 ),Vi<T 

[Hi : p 1 {x B {i)\x l v 1 ) = p{x B {i)\x l v 1 -6 1 )^i<T 
where the existence of causality is entirely reflected into the parameter 9. To be more precise, 8q should be seen 
as a restriction of Q\ when its components linked to xa are set to zero. As a simple example using the model 
approach discussed above, consider the simple linear Gaussian model 

x B(t) = Y J 9 A (i)x A (t-i) + Y, e B( i ) x B(t-i) + Y j 6c(i)x c (t-i)+w B (t) (66) 

z>0 i>0 i>0 

where w B (t) is an i.i.d. Gaussian sequence, and 9 a, Ob :6c are multivariate impulse responses of appropriate 
dimensions. Define Oi = (9a,0b,0c) and Oq = (0,9 B ,0c)- Testing for Granger causality is then equivalent to 
testing 9 = 9\, furthermore, the likelihood ratio can be implemented due to the Gaussian assumption. The example 
developed in ||74| . fl4l . assumes that the probability that neuron b (b u A u C = V) sends a message at time t 
(xb(t) = 1) to its connected neighbors is given by the conditional probability 

Pr(x b (t) = l\x v ;9) =U(Y, A (i)x A (t - i) + S - + E °Eb(i)x E b(t - i) + w(t)) 

i>0 i>0 i>0 
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where U is some decision function, the output of which belongs to [0; 1], A represents the subset of neurons that 
can send information to b, and Eb represents external inputs to b. Defining this probability for all b e V completely 
specifies the behavior of the neural network V. 

The problem is a composite hypothesis testing problem, in which parameters defining the likelihoods have to be 
estimated. It is known that tere is no definitive answer to this problem l53l . An approach that relies on an estimation 
of the parameters using maximum likelihood can be used. Letting Vl be the space where parameter is searched 
for and £1 the subspace where O lives, then the generalized loglikelihood ratio test reads: 

r/ T T\ , sup e6f2 p(x£;0) p(x£;0f) 

l(x A ,x B ) ■■= log v = log — t=- (67) 

where 9j denotes the maximum likelihood estimator of under hypothesis i. In the linear Gaussian case, we will 
recover exactly the measures developed by Geweke. In a more general case, and as illustrated in section IIV-B 1 1 as 
the the maximum likelihood estimates are efficient, we can conjecture that the generalized log likelihood ratio will 
converge to the causal conditional transfer entropy rate if sufficiently relevant conditions are imposed on the models 
(e.g., Markov processes with recurrent properties). This approach was described in ll26ll in the bivariate case. 

V. Conclusions 

Granger causality was developed originally in econometrics, and it is now transdisciplinary, with the literature 
on the subject is widely dispersed. We have tried here to sum up the profound links that exist between Granger 
causality and directed information theory. The key ingredients to build these links are conditional independence 
and the recently introduced causal conditioning. 

We have eluded the important question of how to practically use the definitions and measures presented here. 
Some of the measures can be used and implemented easily, especially in the linear Gaussian case. In a more general 
case, different approaches can be taken. The information-theoretic measures can be estimated, or the prediction can 
be explicitly carried out and the residuals used to assess causality. 

Many studies have been carried out over the last 20 years on the problem of estimation of information-theoretic 
measures. We refer to l47l . [10|, [68 j, 1461 . ll27ll for information on the different ways to estimate information 
measures. Recent studies into the estimation of entropy and/or information measures are l54l . ||93ll . [85). The recent 
report by |92| extensively details and applies transfer entropy in neuroscience using fc-nearest neighbors type of 
estimators. Concerning the applications, important reviews include ll36ll . 11671 . where some of the ideas discussed 
here are also mentioned, and where practicalities such as the use of surrogate data, for example, are extensively 
discussed. Applications for neuroscience are discussed in ll38l . ||251 . B4ll . ifTBI . ifTTll . 

Information-theoretic measures of conditional independence based on Kullback divergence were chosen here to 
illustrate the links between Granger causality and (usual) directed information theory. Other type of divergence could 
have been chosen; metrics in probability space could also be useful in the assessing of conditional independence. 
As an illustration, we refer to the study of Fukumizu and co-workers [21], where conditional independence was 
evaluated using the Hilbert-Schmidt norm of an operator between reproducing kernel Hilbert spaces. The operator 
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generalizes the partial covariance between two random vectors given a third one, and is called the conditional 
covariance operator. Furthermore, the Hilbert-Schmidt norm of conditional covariance operator can be efficiently 
estimated from data. A related approach is also detailed in ll82l . 

Many important directions can be followed. Causality between nonstationary processes has rarely been considered 
(see however ||92l for an ad-hoc approach in neuroscience). A very promising methodology is to adopt a graphical 
modeling way of thinking. The result of lfl"8l on the structural properties of Markov-Granger causality graphs can 
be used to identify such graphs from real datasets. A first step in this direction was proposed by |73|. Assuming that 
the network under study is a network of sparsely connected nodes and that some Markov properties hold, efficient 
estimation procedures can be designed, as is the case in usual graphical modeling. 
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